home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
ftp.cs.arizona.edu
/
ftp.cs.arizona.edu.tar
/
ftp.cs.arizona.edu
/
icon
/
newsgrp
/
group98c.txt
/
000037_icon-group-sender _Wed Sep 16 16:46:55 1998.msg
< prev
next >
Wrap
Internet Message Format
|
2000-09-20
|
5KB
Return-Path: <icon-group-sender>
Received: from kingfisher.CS.Arizona.EDU (kingfisher.CS.Arizona.EDU [192.12.69.239])
by baskerville.CS.Arizona.EDU (8.9.1a/8.9.1) with SMTP id QAA10662
for <icon-group-addresses@baskerville.CS.Arizona.EDU>; Wed, 16 Sep 1998 16:46:55 -0700 (MST)
Received: by kingfisher.CS.Arizona.EDU (5.65v4.0/1.1.8.2/08Nov94-0446PM)
id AA05606; Wed, 16 Sep 1998 16:46:26 -0700
To: icon-group@optima.CS.Arizona.EDU
Date: 16 Sep 1998 15:27:22 -0400
From: richard@goon.stg.brown.edu (Richard L. Goerwitz III)
Message-Id: <6tp3eq$buu@goon.stg.brown.edu>
Organization: Brown University Scholarly Technology Group
Sender: icon-group-request@optima.CS.Arizona.EDU
References: <9809120038.AA13184@hawk.CS.Arizona.EDU>
Subject: Re: Unicode support or support for non-ASCII based character ma
Errors-To: icon-group-errors@optima.CS.Arizona.EDU
Status: RO
Re Unicode, Gregg Townsend <gmt@baskerville.CS.Arizona.EDU> wrote:
> -- In Unicode, there aren't just 26 lower-case letters, and
> they're not all contiguous. What should &lcase contain?
> How would this affect existing programs?
It's funny that this issue has raised its ugly head again. I
guess it was four or five years ago that I advocated moving to
Unicode (the logic was that Icon was a good string processing
language, and that it was kind of silly to confine it to an
eight-bit universe).
The problem with doing this back then was one of resources (the
Icon Project was winding down, and graphics had become the main
concern). Users had some good spats over Unicode, but ultimately
nobody had any resources to bring it off - and some prominent
members of the Icon community denied that Unicode would ever be-
come a prominent standard (aside: sixteen-bit characters are now
the default for NT; Java also works on the same assumption; and
Unicode is the core format for XML, too).
Anyway, aside from me annoying everyone with my Unicode "mantra"
(as Clint called it once), the issue pretty much went away.
To the questions you raised (re lowercase letters, affect on
existing programs):
The number of alphabets with case distinctions is finite, and in
fact rather low (Latin, Greek, Coptic, Armenian, and Cyrillic).
So you define everything (except specifically uppercase letters
in these scripts) to be lowercase. Then you define everything
(except specifically lowercase letters in these scripts) to be
uppercase.
Note that in some languages the number of lowercase letters ex-
ceeds the number of uppercase letters (and the mappings are not
one-to-one). But in most cases, reasonable equivalents can be
conjured up.
I'd be happy to contribute code, if it comes to that.
As for how this would affect existing programs: Heavily. The
assumption of Unicode characters really screws up everything. Not
only do you have to worry about upper/lowercase mapping, but you
also have to think about (as GT notes above) I/O.
Some ideas:
0) data and character streams should be read using different
functions
1) user should be permitted to select a (default) input format
(e.g., ISO 8859-1, UTF-8, UTF-16, etc.) for character
stream readers
2) user should be permitted to select a (default) output format
(e.g., ISO 8859-1, UTF-8, UTF-16, etc.) for character
stream writers
3) all internal strings must be represented as UCS-2 (or UCS-4,
if you don't care about memory); you can't use UTF-8, because
multi-byte variable-length characters are no good for any
code that relies on fixed-width characters
For most programs, the user would just select ISO 8859 as the
default character input and output format.
The whole thing is a mess, and my guess is that unless there is
a new source of funding for Icon - and an effort to put in a
ground-up rewrite - it would not be feasible to rearrange every-
thing to support Unicode.
If it does happen, I suspect that it will happen in a descendant
language that incorporates fundamental features like networking.
Two years ago or so I saw some excellent networking extensions to
Icon posted here, incidentally. Seemed to me they gave Icon a
fighting chance.
What ever became of them?
One other thing to think about: What will the PERL community so
on the Unicode/XML front? Keep your eyes peeled.
--
Richard Goerwitz
PGP key fingerprint: C1 3E F4 23 7C 33 51 8D 3B 88 53 57 56 0D 38 A0
For more info (mail, phone, fax no.): finger richard@goon.stg.brown.edu